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Supercomputer “Fugaku”, Formerly Known as Post-K Fujitsu 


@ “Fugaku” is named after Mt. Fuji 
B Highest mountain in Japan 
B Very broad gradual slopes 
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Supercomputer “Fugaku”, Formerly Known as Post-K 


m “Fugaku” is named after Mt. Fuji 
B Highest mountain in Japan 
u Very broad gradual slopes 


Co-design w/ application developers and Fujitsu-designed CPU core w/ 
high memory bandwidth utilizing HBM2 


Leading-edge Si-technology, Fujitsu's proven low power & high 
performance logic design, and power-controlling knobs 


Arm®v8-A ISA with Scalable Vector Extension ("SVE"), and Arm standard 
Linux 
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Fujitsu-designed CPU Core w/ High Memory Bandwidth FUJITSU 


m A64FX out-of-order controls in 
cores, caches, and memories 
achieve superior throughput Core) (Core) (Core 
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CMG 12x Compute Cores + 1x Assistant Core 


512-bit wide SIMD 


BW and calc. perf. A64FX 


DP floating perf. (TFlops) 2.7+ = 
L1 data cache (TB/s) 11+ 4 
L2 cache (TB/s) 3 | 13 
Memory BW (GB/s) 1024 0.37 
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A64FX Optimized Load Efficiency for Apps Performance FUJITSU 


m 128 bytes/cycle sustained bandwidth pead poro H2 cache a 
even for unaligned SIMD load —2 un 
Read port] eae 

= + | 


m "Combined Gather" doubles gather 
(indirect) load's data throughput, 
when target elements are within a 
"128-byte aligned block" for a pair of 
two regs, even & odd regs [0] 1 PaEN4 [5 alu 

> 


Suggested through Co-design work w/ app teams 8B 


Maximizes BW to 32 bytes/cyc. 
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A64FX Power Knobs to Reduce Power Consumption rujirsu 


m “Power knob" limits units’ activities via user APIs 
E Performance/Watt can be optimized by utilizing Power knobs 
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A64FX Leading-edge Si-technology FUJITSU 


m TSMC /nm FinFET & CoWoS PA i 
m Broadcom SerDes, HBM 1/0, and SRAMs 
= 87.86 billion transistors a 
E 594 signal pins f 
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“Fugaku" CPU Performance Evaluation (1/3) FUJITSU 


E Over 2.5x faster in HPC & Al benchmarks than SPARC64 XIfx 


A64FX chip performance measurements & architectural contributions 
8x - : 


INT8 partial 
dot product 


512-bit Memory 
6x SIMD mn 
I | 


= 


Combined L1 cache L2 cache 
gather bandwidth — bandwidth | 
l 


Performance increase 
normalized by SPARC64 XIfx 
È 


DGEMM STREAM Triad Fluid dynamics Atomosphere Seismic Conv. FP32 Conv. INT8 
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“Fugaku" CPU Performance Evaluation (2/3) FUJITSU 


m Himeno Benchmark (Fortran90) 
E Stencil calculation to solve Poisson's equation by Jacobi method 
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Intel Xeon Fugaku A64FX SX-Aurorat Tesla V100t 
Platinum 8168 1 CPU 1 VE 1 GPU 
2 CPUs t "Performance evaluation of a vector supercomputer SX-aurora TSUBASA", 


SC18, https://dl.acm.org/citation.cfm?id=3291728 
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“Fugaku" CPU Performance Evaluation (3/3) FUJITSU 


m WRF: Weather Research and Forecasting model 
u Vectorizing loops including IF-constructs is key optimization 
u Source code tuning using directives promotes compiler optimizations 


= 
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1 / (Iteration time) 
Normalized by Xeon 
BR 


WRF v3.8.1 (48-hour, 12km, CONUS) on 48 cores 


1.56 
1.32 


Intel Xeon Platinum 8168 Fugaku A64FX Fugaku A64FX 
2 CPUs 1 CPU (asis) 1 CPU (w/ src tuning) 
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OSS aaa Ponad @ Arm HPC Users Group FUJITSU 
(http://arm-hpc.gitlab.io/) 


Issues found Issues found 
Modified 
C Modified 
Issues found Issues found Modified 
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Summary: “Fugaku" and Fujitsu Commercial Units Fujitsu 


m “Fugaku” is designed and runs applications at the 
highest level performance to be worthy of the name 


m Arm HPC ecosystem and expanding apps portfolio 
are likened to the broad gradual slopes of Mt. Fuji 


mage of commercial unit 
rom Fujitsu 


Fujitsu designed 48-core CPU 
j 


m Fujitsu began production of 
"Fugaku", also advances 
productization of commercial 
units based on the 
supercomputer technology 


Broa 2 ı 
TSMC 7nm FinFET & CoWoS 
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